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Abstract —We propose a novel solid-state disk (SSD) architecture that utilizes a double-data-rate synchronous NAND flash interface 
for improving read and write performance. Unlike the conventional design, the data transfer rate in the proposed design is doubled in 
harmony with synchronous signaling. The new architecture does not require any extra pins with respect to the conventional architecture, 
thereby guaranteeing backward compatibility. For performance evaluation, we simulated various SSD designs that adopt the proposed 
architecture and measured their performance in terms of read/write bandwidths and energy consumption. Both NAND flash cell types, 
namely single-level cells (SLCs) and multi-level cells (MLCs), were considered. In the experiments using SLC-type NAND flash chips, 
the read and write speeds of the proposed architecture were 1.65-2.76 times and 1.09-2.45 times faster than those of the conventional 
architecture, respectively. Similar improvements were observed for the MLC-based architectures tested. It was particularly effective to 
combine the proposed architecture with the way-interleaving technique that multiplexes the data channel between the controller and 
each flash chip. For a reasonably high degree of way interleaving, the read/write performance and the energy consumption of our 
approach were notably better than those of the conventional design. 

Index Terms —Solid-state disk (SSD), Double-data rate (DDR), NAND flash memory. Interleaving 
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1 Introduction 

AND-flash-based solid-state disks (SSDs) are re¬ 
placing hard disk drives (HDDs), the mass storage 
device of choice for many decades, not only in high- 
end servers but also in mainstream PCs and in low- 
end mobile internet devices (MIDs). The compelling 
reason for fhis change can be atfribufed to the absence of 
mechanical moving parts in SSDs; this fact can substan¬ 
tially enhance key characteristics of mass sforage devices 
such as read/write performance, power consumption, 
weights, form facfors, reliability, shock resistance, and 
many others. In particular, the improved read/write 
performance of SSDs is expecfed to narrow the so-called 
CPU-IO performance gap [1], which has been a long¬ 
standing problem for accelerating computer systems. 
Due to the recent advent of multi-core CPUs, the CPU- 
IO performance gap would become even wider wifhouf 
a breakfhrough in lO sysfems. Thus, the read/write per¬ 
formance has become one of the most important metrics 
to determine the overall merit of a storage device. 

The two major components of a typical NAND-based 
SSD are the following: i) a number of NAND flash 
memory chips and ii) a control circuitry called the SSD 
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controller, which manages the data transfer between the 
NAND flash chips and the host machine the SSD is 
attached to. The system-level read/write speed of an 
SSD is offen orders of magnitude faster than that of 
HDDs, buf this is not because the individual NAND 
flash chips inside the SSD are that fast. In fact, a major 
performance botfleneck in an SSD may occur due fo the 
latency of accessing NAND flash memory. For instance, 
the time to program a flash cell is normally in fhe range 
of hundreds of microseconds, which is several orders 
of magnifude greater than the t 5 rpical clock-cycle time 
of the SSD controller. Thus, the SSD controller should 
frequently slow down or be idle in order to keep pace 
with NAND flash memory, thereby incurring a perfor¬ 
mance loss. SSDs can be faster than HDDs because of 
fhe various techniques employed to hide and / or reduce 
the latency of sluggish NAND flash memory, as will be 
surveyed shortly. 

The NAND flash access time issue has become more 
critical due to the advent of mulfi-level-cell (MLC) flash 
memory. A fraditional NAND flash chip can store only 
one bit per cell and is called single-level-cell (SLC) 
flash memory. In confrast, MLC flash memory can sfore 
multiple bits per cell. Thus, MLC flash memory is more 
cost-effective, since it demands much less die space than 
SLC flash memory, in order fo integrate the same ca¬ 
pacity using the same process technology. Unfortunately, 
the MLC implementation inevitably increases the access 
time. For instance, it is known that the cell program 
time of MLC flash memory is approximately three times 
larger than that of SLC flash memory. Nevertheless, the 
adoption of MLC flash memory will rapidly grow, since 
the MLC implementation can significantly lower the per- 
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bit cost, which is still much higher than that of HDDs. 

The core competency of SSDs over the HDDs can 
thus be obtained by trading off the access time and the 
cost of NAND flash memory in an effective manner. 
This point was recognized early, and many techniques 
have been proposed to alleviate the access time issue. 
As detailed in Section 2.3, examples include way inter¬ 
leaving, channel striping, and caching. Strictly speaking, 
these techniques are more for hiding the NAND flash 
access latency, rather than reducing it. There exist other 
approaches targeting on actual reduction of the latency. 
A key idea of these techniques is to replace the conven¬ 
tional as 5 mchronous NAND flash interface scheme by 
a synchronous one, an idea that stems from the history 
of DRAM: the initial as 5 mchronous DRAM interface was 
later replaced by faster synchronous interfaces. However, 
the limitation of these approaches is that they require 
additional pins, thereby causing area overhead and in¬ 
compatibility with the traditional components. 

Our approach proposed in this paper belongs to the 
category of techniques to reduce the latency itself. More 
precisely, the contributions of our work are two-fold. 
First, we propose a novel SSD architecture that utilizes a 
double-data-rate (DDR) synchronous NAND flash inter¬ 
face for improving read and write performance. Unlike 
the conventional design, the data transfer rate in the pro¬ 
posed design is doubled in harmony with synchronous 
signaling. Furthermore, the new architecture does not 
require any extra pins with respect to the conventional 
architecture, thereby guaranteeing backward compatibil¬ 
ity. Second, we thoroughly validate the performance of 
our approach by simulating various SSD designs that 
adopt the proposed architecture and by measuring their 
read and write bandwidths as well as energy consump¬ 
tion. Moreover, we show how the proposed architecture 
is combined with the two most popular latency-hiding 
techniques, namely way interleaving and charmel strip¬ 
ing, for their S5mergistic effects on overall performance 
at SSD-level. For realistic results, we consider both SLC 
and MLC NAND flash memory. 

The rest of this paper is organized as follows. Section 2 
introduces the basics of SSD architectures and discusses 
possible options for enhancing SSD performance. This 
section also provides a brief review on previous ap¬ 
proaches for resolving the latency issue in NAND flash 
memory. In Section 3, we describe the conventional SSD 
architecture that uses the single-date-rate asynchronous 
NAND flash interface. The proposed SSD architecture 
that utilizes the new DDR synchronous NAND flash 
interface is detailed in Section 4. Finally, we provide 
our experimental results in Section 5 followed by a 
conclusion in Section 6. 

2 Preliminaries and Related Work 



controller to manage the data transfer between the host 
machine and the NAND flash chips. The controller con¬ 
tains various components such as a processor, random 
access memory (RAM), read only memory (ROM), a host 
interface, and a NAND interface. The processor governs 
the controller by executing the firmware residing in 
the ROM chip. Some notable tasks of the processor 
include wear leveling and address translation, as will be 
explained in Section 2.2.1. The NAND interface labeled 
NAND_IF in Fig. 1 is to communicate with the NAND 
flash chips. 

Each NAND flash memory chip in the SSD architec¬ 
ture is composed of a cell array, a page register, an XY 
decoder, a control logic, lO buffers, and latches. The cell 
array stores the entire set of data, while the page register 
temporarily stores one page of the data being requested 
for read or write. The XY decoder decodes the address 
issued by the controller, and the control logic manages 
the interface with the controller. The data transfer time 
from the cell array to the page register is defined as 
in, and the time for the reverse action (i.e. the time to 
transfer data from the page register to the cell array) 
is called the page program time or tpRoa- Typically, 
tpROG is much larger than tp. The data transfer time 
between the page register and lO buffer is referred to as 
tpYTE- Finally, tppA is the data transfer time between 
the lO buffer and lO pads. More details on these timing 
parameters will be presented in Table 1 in Section 3. 

2.2 Options for Improving SSD Performance 

For the SSD architecture shown in Fig. 1, the opportuni¬ 
ties for performance improvements can be summarized 
as follows: i) to enhance the performance of NAND 
flash cells, ii) to optimize the performance of the SSD 
controller, iii) to use a faster interface between the SSD 
and its host, iv) to accelerate the interface between the 
SSD controller and the NAND flash chips, and v) the 
mixture of these options. We survey the techniques based 
on options ii)-iv). Option i) is beyond the scope of 
this paper and will not be discussed further; interested 
readers are directed to [2], [3], [4], [5], [6], [7], [8]. 


2.1 Typicai SSD Architecture 

Fig. 1 shows the architecture of a typical SSD, which is 
composed of multiple NAND flash memory chips and a 


2.2.1 Optimizing SSD Controiier 

This may be the option that has been most actively 
studied. From the hardware perspective, one of the most 
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Fig. 2: An SSD architecture with 4 channels and 4 ways 
per channel. 


frequently used techniques is to increase data through¬ 
put by parallelizing the data paths between the controller 
and NAND flash chips. Such pafhs are called channels, 
and there are largely two methods for the paralleliza¬ 
tion. One is called channel striping, which means using 
multiple charmels in the NAND flash interface. The 
other is called way interleaving, and this is to multiplex 
each charmel to send data in a round-robin fashion. By 
exploiting these techniques, it is possible to hide much 
of the latency of NAND flash memory. 

Fig. 2 is an example of the SSD architecture adopting 
the techniques of charmel sfriping and way interleaving 
simultaneously. The number of channels and ways in 
this example are both four. Of note is that charmel 
striping is often more costly than way interleaving, 
since each charmel requires a NAND interface block 
and an error correction code (ECC) block. The ECC 
block is essential for data reliability, especially when 
the MLC flash is used. Another area penalty of multi- 
charmel design comes from increased pin counts. Each 
charmel requires dedicated pins to communicate with the 
dedicated NAND flash memory chips. For this reason, 
the number of charmels should carefully be selected in 
order to achieve the required system performance within 
the area budget. 

Another performance improvement technique from 
the controller perspective is to optimize the software 
called translation layer (FTL) [9], [10], [11]. FTL runs 
on the processor of an SSD controller and performs 
mapping between logical and physical addresses and 
also handles important housekeeping tasks such as wear 
leveling [12] and garbage collection. Wear leveling is 
to use all the flash cells in a chip as uniformly as 
possible and plays a crifical role to maintain the initial 
performance and capacity of an SSD over time, since 
the lifetime of a flash cell is directly limited by its write 
frequencies. 

Besides, in most commercially available SSDs, DRAM 
is used as a cache buffer to hide the long access latency of 
NAND flash memory. If the data requested by the host 


machine happens to be found in the cache buffer, we 
can completely eliminate the data access time to NAND 
flash memory. 

Refer to Sections 2.3.1 and 2.3.2 for a brief survey of 
the existing approaches that belong to this category. 

2.2.2 Improving Host Interface 

This option is to increase the bandwidth between the 
SSD and its host machine. Currently, SSDs are attached 
to the host machine via legacy interfaces inherited from 
HDDs such as parallel advanced technology attachment 
(FATA) and serial-ATA (SATA) [13]. To achieve higher 
performance with less pin counts, SATA is rapidly re¬ 
placing FATA these days both for HDDs and SSDs. In 
addition, to handle properly the increased bandwidth of 
SSDs, alternative high-speed interfaces such as periph¬ 
eral component intercormect express (FCIe) have been 
tried for interfacing SSDs. Recenfly, it was proposed in 
[14] to attach SSDs to the North Bridge chipset using 
the DRAM interface, instead of using the South Bridge 
chipset in which the SATA and FATA controllers reside. 

2.2.3 Accelerating NAND Flash Interface 

This is to increase the bandwidth between the controller 
and each NAND flash memory chip. Even though the 
objective of this option is similar to that of charmel sfrip¬ 
ing or way interleaving, this option is more aggressive 
in the sense that the read and write bandwidths can be 
improved by reducing the latency directly, rather than 
hiding it. A key technique in this category is to improve 
the NAND flash interface scheme in a synchronous 
fashion. Section 2.3.3 presents more details of existing 
techniques for accelerafing NAND flash interfaces. 

2.3 Related Work 

2.3.1 Hiding the Latency of NAND Flash Memory 

The effect of channel striping and way interleaving was 
extensively studied in [15], which used a 2-charmel, 4- 
way-interleaving interface scheme wifh a software ar¬ 
chitecture adopting a hybrid-mapping algorithm. The 
proposed system outperformed the compared HDD by 
77%. The improvement was mainly due to the increased 
parallelism and the interleaved accesses when program¬ 
ming NAND flash memory. However, the limitations 
of this approach include area overhead and compli¬ 
cated controller design due to the increased number 
of channels. Other approaches to latency-hiding include 
the techniques proposed in [16], [17], where DRAM 
was used as the cache buffer for NAND flash memory. 
When a cache hif occurs, the data access time is solely 
determined by the DRAM access time, which is much 
smaller than the flash access time. 

2.3.2 Optimizing the Firmware of SSD Controller 

The techniques in this category aim at enhancing the 
SSD performance by reducing the data transfer size, 
operating time, and the number of extra operations 
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required for wear leveling. The technique presented in 
[18], [19], [20] compresses the data from the host unit to 
save the storage space in NAND flash memory and to 
reduce the data transfer time from the controller to flash 
chips. However, this method may incur extra time and 
area overheads for data (de)compression. The hybrid¬ 
mapping technique proposed in [9] aimed at improving 
the write speed by introducing two t 5 rpes of logical 
blocks called data blocks and log blocks. The number of 
log blocks is much smaller than that of data blocks, and 
data is always written to log blocks first. When all log 
blocks are used up, the FTL moves the data from log 
blocks to data blocks. This technique may incur extra 
computation overhead but can be beneficial for quick 
search owing to the small number of log blocks. The 
techniques introduced in [10], [11], [21], [22] can reduce 
the number of erase operations by using a page-map 
cache and smart mapping strategies; it was shown that 
the system performance can be enhanced by reducing 
the number of erase and garbage collection operations. 


2.3.3 Improving Controller-Flash Interface 

In [23], the authors introduced a S 5 mchronous NAND 
flash interface using a signal called data valid strobe 
(DVS). This interface improved the sensitivity to the 
process, voltage, and temperature (PVT) as well as the 
read performance by isolating the timing of the con¬ 
troller from that of the NAND flash memory. However, 
this approach exploited only one edge of each clock 
signal, producing limited performance improvements. 
The focus of this work was more on desensitizing PVT 
variations rather than on boosting read and write per¬ 
formance. 

Recently, some leading companies in the SSD business 
organized an initiative called open NAND flash interface 
(ONFI) and proposed a DDR flash interface scheme, 
whose specification is available at [24]. Additionally, 
the authors in [25] proposed a similar concept along 
with a new SSD architecture. However, these approaches 
require additional pins, thus causing compatibility issues 
and area overhead. Furthermore, no quantitative analy¬ 
sis was performed to prove the effectiveness of these 
approaches and to show the impact of DDR interface 
schemes on the SSD performance. 

Our work presented in this paper belongs to the 
category of techniques to accelerate the interface be¬ 
tween SSD controller and NAND flash chips. Unlike 
the aforementioned approaches, our DDR synchronous 
interface scheme provides pin-level compatibility with 
the traditional NAND flash memory interface. Moreover, 
we evaluate the effect of the proposed technique quan¬ 
titatively with respect to various architectural choices 
(e.g. the number of channels and ways) from the SSD 
perspective. 


3 Conventional Asynchronous NAND 
Flash Interface for Solid-State Disks 

The overall structure of a t 5 Tpical SSD was explained in 
Section 2. In this section, we present additional details 
on the conventional method for interfacing the controller 
and the NAND flash chips in SSDs. The material in this 
section is crucial for understanding the new interface 
architecture proposed in Section 4. The major difference 
between the two architectures lies in the controller-flash 
interface; the conventional interface uses a asynchronous 
single-data-rate scheme, whereas the proposed design 
utilizes a s 5 mchronous double-data-rate scheme. 

3.1 Block Diagram and Key Components 

Fig. 3 shows the conventional asynchronous interface ar¬ 
chitecture. Note that only the NAND_IF block is drawn 
inside the controller block for clarity, although there 
exist additional blocks, as shown in Fig. l.The NAND_IF 
block and the NAND flash chip communicate over three 
types of ports. The upper two ports are for transferring 
data strobe signals, and the lower one is for exchanging 
all the other control signals as well as data. 

Inside the NAND_IF block, there are two blocks called 
generate write (Gen_W) and generate read (Gen_R). The 
signal to control writes is called write enable bar (WEB) 
and is generated by the Gen_W block. The read control 
signal is named read enable bar (REB) and is produced by 
the Gen_W block. WEB and REB are sent to the NAND 
flash chip via the upper two ports of the interface. The 
D_CON block is to delay the clock (CLK) so that data 
transfers at the interface can fulfill any given timing 
specifications. The blocks called WFIFO and RFIFO are 
for buffering data from and to the host, respectively. 

The lO latches inside the flash chip include timing- 
critical parts called write latch (WLAT) and read latch 
(REAT). WLAT temporarily stores the data from the con¬ 
troller to the page register, whereas RLAT temporarily 
stores the data from the page register to the controller. 

3.2 Timing Parameters 

To explain the write and read operations of the SSD in¬ 
terface architecture in Sections 3.3 and 3.4, we first show 
in Table 1 a number of important timing parameters for 
the interface building blocks. In the table, note that the 
first eight parameters are common for the conventional 
and the proposed interfaces. The next four are only for 
the conventional architecture; the rest are only for the 
proposed architecture detailed in Section 4. Additional 
timing parameters of NAND flash chips themselves are 
available in [26], [27], [28]. 

3.3 Write Operation and Timing 

Eig. 4(a) shows the write timing diagrams of the con¬ 
ventional NAND flash memory interface. The controller 
asserts WEB and issues the first write command (CMD) 
to the flash chip in order to initiate a write operation. 
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Fig. 3: Block diagram of the NAND flash memory interface in the conventional SSD architecture. 


TABLE 1: Timing parameters for the conventional and proposed interface architectures. 


Parameter 

Conventional (Fig. 3) 

Proposed (Pig. 5) 

tp 

Clock (CLK) period 

in 

Delay amount of CLK by D_CON (i.e. difference between CLK and DCLK); tp, = a ■ tp, where 0 < a < 1/2 

ts/tH 

Setup/hold time of WFIFO and RFIFO 

tR 

Data fetch time (from Cell Array to Page Register) 

tpROG 

Program time (from Page Register to Cell Array) 

tsYTE 

Data transfer time between Page Register and WLAT/RLAT 

two 

Write cycle time (i.e. one cycle of WEB) 

tRC 

Read cycle time (i.e. one cycle of REB) 

tiN 

Data propagation time between the lO pad 
of the controller and WFIFO/RFIFO 


toUT 

Signal propagation time from FFs of the controller 



to the strobe pads of NAND flash memory 

N/A 

tos/toH 

Setup/hold time of lO signals with respect to WEB 


tREA 

Data transfer time from RLAT to 
the lO pad of the controller 


i-DIFF 


Difference between the arrival time of DVS at RFIFO 

and the arrival time of lO in the NAND flash at RFIFO 

toLL 


Time delay by DLL as defined in Eq. (2) 

tRWEBD 


Propagation delay of RWEB from 


N/A 

the strobe port of NAND flash memory to DLL 

tlOs/tlOH 


Setup/hold time of lO signals with respect to DVS 

tiOD 


Data propagation delay from RLAT to the lO pad 
of NAND flash memory 

tRWC 


One cycle of RWEB; replaces tpc and two 


The destination addresses are then sent to the flash chip 
followed by a series of data to be written to the page 
register through WLAT at every two, the period of WEB. 
Finally, the controller issues a program CMD to transfer 
the data in the page register to the cell arrays of the flash 
chip. During the program phase, the flash memory chip 
enters the busy state and cannot be interrupted until the 
end of the program phase. This time duration is defined 
as tpRoa and is normally very long. 

Note that, in the write mode, both control (i.e. WEB) 
and data are concurrently transferred from the controller 


to the flash chip; the delays of the control and data 
paths are almost identical. The conventional interface 
operates synchronously in the write mode in the sense 
that transfers are synchronized to the periodic WEB 
signal under the timing constraints set hy tps and ton- 
The data transfer rate in the write mode can therefore be 
improved by increasing the frequency of WEB. However, 
the conventional interface is not considered synchronous 
due to the asynchronous read mode, as will be explained 
next. 
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Fig. 4: The timing diagrams of the conventional asynchronous NAND flash memory interface. 


3.4 Read Operation and Timing 

The timing diagrams for the read operation are shown in 
Fig. 4 (b). After issuing the first read CMD followed by 
the destination address, the second read CMD is issued 
to the flash chip. It then enters the busy state for fetching 
data from the cell arrays to the page register. This data 
fetching time is defined as ta, which is much shorter 
than tpROG- Thus, the data transfer time between the cell 
arrays and the page register is not as critical in the read 
mode as it was in the write mode. At the completion 
of the fetch, the flash chip enters the ready state, and 
the controller periodically asserts REB to the flash chip 
wifh the period of tac- For each REB cycle, the control 
logic inside the flash chip instructs a single data transfer 
from fhe page register to RLAT within Ibyte, and the 
data reach the lO ports of fhe controller within Irea- 
The controller then fetches the data into RFIFO at the 
positive edge of DCLK, a delayed version of CLK hy to- 
More precisely, tu is defined as 

tp = a ■ tp, (1) 

where 0 < a < ^. Note that DCLK is used to satisfy 
fhe setup time constraint imposed on RFIFO. Without 
DCLK, the system may easily violate the timing con¬ 
straint due to the variations of t/Ar, tour, and ipEA- 
Thus, each operation of propagating REB and fetching 
data is allowed to take at most tpc + tD, instead of tpe¬ 
lt is crifical to notice the following: In fhe read mode 
of the conventional interface, the control (i.e. REB) and 
data cannot be propagated concurrently, unlike the write 
mode. That is, REB is first propagated from fhe controller 


to the flash chip, and then the data transfer occurs in 
fhe opposife direction. Consequently, a single read cycle 
should be determined by the sum of the propagation 
delays of REB and data, unlike the write mode in which 
a write cycle can be set by the maximum of the two 
delays. Eor this reason, tpc is normally longer than 
twCi although the specification of commercial NAND 
flash memory usually lists identical timing parameters 
for convenience. The new inferface archifeefure proposed 
in fhe next section focuses on reducing the read cycle 
time in order to enhance read performance. 

4 Proposed DDR Synchronous NAND 
Flash Interface for Solid-State Disks 

In this section, we provide the details of the proposed 
NAND flash interface for improving SSD performance. 
This new archifeefure focuses on enhancing fhe data 
throughput between NAND flash memory chips and 
fhe SSD controller. To this end, the proposed scheme 
operates in a S 5 mchronous manner for both read and 
write modes and supports double-data-rate transfers. 

As highlighted in Section 3, a major performance bot- 
fleneck in fhe conventional NAND flash memory is the 
serialized, opposite-directional propagation of control 
and data in the read mode. The propose interface breaks 
fhis serialized propagation paths into two smaller ones 
— one for control and the other for data — and isolate 
them from the perspective of timing. More precisely, the 
REB control is generated by CLK and is propagated just 
as in the conventional architecture. On the other hand, 
the data is fetched from the flash chip fo the controller 
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Fig. 5: Block diagram of the proposed double-data-rate synchronous NAND flash memory interface. 


in synchronization with a new control signal named data 
valid strobe (DVS), as depicted in Fig. 5. DVS is a data 
strobe asserted by the flash chip and can be considered 
as a data clock whose edges indicate stable points for 
data fetching. 

Introducing DVS is for the synchronous read opera¬ 
tion. To support DDR operation, we duplicate the RFIFO 
and WFIFO buffers inside the controller and the RLAT 
and WLAT latches inside the flash chip. In the controller, 
one pair of RFIFO and WFIFO is dedicated to the rising 
edge of CLK, and the other pair to the falling edge of 
CLK; in the flash chip, one pair of RLAT and WLAT is 
for the rising edge of DVS, and the other pair for the 
falling edge for DVS. 

The notion of DVS was first introduced in [23], but 
the purpose of that work was not to increase the data 
bandwidth but to desensitize the PVT variations as 
discussed in Section 2.3. In contrast to [23], the proposed 
design can enhance the overall read/write performance 
of an SSD by allowing double-data-rate data transfers 
between the controller and flash memory. We compare 
the performance of the interface introduced in [23] and 
that of the proposed architecture in Section 5. 

The proposed scheme differs from the popular DDR 
DRAM interface in that the proposed architecture does 
not require an additional memory clock, since REB is 
replaced by the bidirectional DVS signal. Replacing REB 
by DVS, rather than adding an extra pin, is beneficial for 
maintaining backward compatibility with conventional 
components and boards. 

Note that in the proposed architecture we rename 
WEB as RWEB, since it is used for both read and write 
modes. 

4.1 Proposed Interface Architecture 

Eig. 5 shows the block diagram of the proposed DDR 
synchronous NAND flash memory interface. As stated 


early, REB has been replaced by DVS for synchronous 
operations, and the EIEOs and latches have been du¬ 
plicated for DDR operations. The multiplexers are used 
inside the NAND flash chip in order to select WLAT for 
writes and RLAT for reads, depending on the edge type 
of RWEB. Now that RWEB is commonly used for both 
read and write modes, we do not need to distinguish 
two and tuc and thus use Irwc as the common timing 
parameter representing two and Irc- The D_CON and 
Gen_R blocks are not required in the proposed interface 
design but are included in the design shown in Eig. 5 
for guaranteeing backward compatibility. 

Note that the timing-critical path in the read mode 
is broken into two parts in the proposed design. One is 
the path for propagating RWEB, and the other is the data 
path from the NAND flash memory to the controller. The 
delay of the first path determines tawcr since RWEB 
propagates through the same path in the write mode. 
Thus, tRwc is identical to twcr rather than tac of the 
conventional NAND flash memory. The delay of the 
data path in the proposed architecture is shorter than 
tfic of the conventional architecture. This is because 
the propagation delay of RWEB does not need to be 
considered for calculating the data propagation delay. 
Consequently, the proposed interface can provide higher 
data throughput than the conventional one can. 

To generate DVS at a stable data point, we use a delay- 
locked loop (DLL) circuit. DLL is triggered by the data 
from RLAT and generates DVS by delaying RWEB to 
satisfy the setup time (f/os) and the hold time (t/on) 
constraints at the input of the controller. We define the 
time delay by the DLL as Idll, which is given by 

t-DLL = tiOD ,max — tnwEBD ,min + tios ( 2 ) 

where Irwebd is the propagation delay of RWEB from 
the input port of the NAND flash memory to the DLL, 
and tjoD is the data propagation delay from RLAT to 
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the lO pads of the NAND flash memory. Note that the 
small variation in data availability can easily be adjusted 
by the DLL block. 

4.2 Write/Read Operation and Timing 

Fig. 6 shows the write and read timing diagrams of the 
proposed DDR S 5 mchronous NAND flash interface. In 
the proposed interface, data is transferred at both rising 
and falling edges of the RWEB signal in the write mode, 
as represented in Fig. 6(a). The data transfer rate can 
thus be improved by a factor of two compared with the 
conventional design. In the read mode shown in Fig. 6 
(b), the controller asserts RWEB, instead of REB, to the 
NAND flash memory at Ir after issuing the second CMD 
is completed. At the same time, the first data is pre¬ 
fetched to RLAT from the page register. The data are 
then moved from RLAT to the lO ports and the DLL 
block that delays RWEB by tuLL for DVS generation. 
Einally the controller fetches the data at the falling edge 
of DVS. Eor the next series of data, DVS is generated in a 
similar manner, and the controller fetches at both edges 
of DVS. 

The major difference of the proposed design with 
respect to the conventional one is the concurrent propa¬ 
gation of control signals and data. Hence, it is possible 
for the proposed scheme to have a shorter read cycle 
than the conventional design. 

4.3 Determining Operating Ciock Period 

To compare the proposed and the conventional architec¬ 
tures in terms of their maximum operating frequency, 
we calculate the minimum period of the system clock 
(i.e. tp,min) for each architecture. 

4.3.1 Conventional Interface 

By design, tp should be at least the larger of Irc and 
twC/ which are the periods of REB and WEB, respec¬ 
tively. From Section 3.4 recall that tpc > two since the 
propagation of REB and data should be serialized and 
happen within the same cycle in the read mode. Thus, 
we can ignore two for computing tp,min- 

To determine tp,min/ we also need to consider Ibyte 
since the data transfer between the page register and 
RLAT occurs in a distinct clock cycle that precedes the 
REB and data propagation. If this Ibyte parameter is 
greater than tpc, tp.min should be determined by Ibyte- 
Consequently, fp^min is given by 

tp,min = maK.{tpCjtBYTE}- (3) 

Since REIFO is clocked by D_CON, which delays CLK 
by tp, the propagation of REB and data can take longer 
than tpcr as already explained in Section 3.4. In other 
words, the following equality should hold: 

tRC + to = to^ +tREA + tiN + ts, ■ (4) 

For REB 


Plugging Eq. (4) into Eq. (3) gives 

tp ,min — max{toC/T -f [tpEA + tlN + ts) — to, tpYTE} (5) 

which further develops to 

, f toUT + (tpEA + tiN + ts) , \ ,,, 

^P.min = max < -—- ,tBYTE 1 , (6) 

( 1 -I- a J 

by applying Eq. (1) to Eq. (5). The maximum clock 
frequency of the conventional design can then be de¬ 
termined by Eq. (6). 

4.3.2 Proposed Interface 

Eor the proposed architecture, the value of tp should be 
at least the larger of tpwc and Iryte, namely 

tp.min = VAax.{tRWC ,tBYTE} ^ (7) 

since tpwc plays the role of tpc- 

Recall that the parameters tws and tjoH represent 
the setup and hold time constraints of data with respect 
to DVS at the lO pad of the controller, respectively. By 
design, tpwc is identical to the period of DVS, which 
should be at least twice the sum of tios and as 

shown in Eig. 7(a). In other words. tp,min of the proposed 
architecture is given by 

tPjinin — max{(t70S +tlOH) X 2,tBYTE}, (8) 

where the term (t/os + tion) is doubled since the 
proposed design supports DDR, and a single DVS cycle 
should thus be long enough to manage two transfers. 

The architecture shown in Fig. 5 assumes that the 
controller and the NAND flash memory chips are in¬ 
tegrated into a single board. Thus, tios and tjon are 
affected by the geometric parameters of the board-level 
interconnects. When the board-level design parameters 
are available, we can derive an alternative representation 
of tp,min given by 

tp 

,min — max{(ts-I-tp-I-fp/pp) X 2, tpypp} , (9) 

where tp and tp are the setup and hold times of RFIFO, 
respectively, and tp,iPF is the difference between the 
arrival time of DVS to RFIFO and the arrival time of 
lO in the NAND flash memory to RFIFO. As informally 
shown in Fig. 7(b), toiFF is caused by the different 
interconnect delays of DVS and lO at the board level. 
In Eq. (9), note that ts and tp are independent of the 
geometric parameters of the board and that tpipp also 
becomes a constant once the geometric parameters of the 
interconnects at the board have been decided. 

The maximum clock frequency of the proposed design 
can be determined from either Eq. (8) or Eq. (9). 

5 Experimental Results 

We present our results obtained from the experiments 
conducted to evaluate the performance of the new in¬ 
terface architecture proposed in Section 4. In particular, 
we measured the write and read bandwidths of various 
SSD architectures that utilize the proposed interface 


For data 
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tpwc 



(a) Write mode 


tRWC 



(b) Read mode 

Fig. 6: Timing diagrams of the proposed DDR synchronous NAND flash interface. 


F tpwc T 



DDR,O[7:0,-□[jX] 


(a) 



(b) 


Fig. 7: Determining the minimum clock period of the 
proposed architecture: (a) tnwc should be at least 
{tios + tion) X 2- (h) The interconnect delays for DVS 
and to are different. 


design but are based upon different architectural and 
device-level choices such as the amount of channels, the 
degree of way interleaving and the t 5 rpe of flash cell 
(i.e. SLC/MLC). In addition, we measured the energy 
consumption of the proposed architecture. By compari¬ 
son with the conventional flash interface, we show how 
much impact the proposed scheme has on the SSD-level 
performance in a variety of scenarios. 

After detailing the experimental setup in Section 5.1, 
we explain in Section 5.2 how the operating frequencies 
of the tested architectures were determined. Section 5.3 
presents the results of our experiments conducted to 


evaluate the read/write performance and energy con¬ 
sumption at SSD-level. 

5.1 Experimental Setting 

Based upon the basic architecture shown in Fig. 1, two 
versions of SSD simulators were implemented: one for 
the conventional design and the other for the proposed 
design. The former employs the as 5 rnchronous interface 
shown in Fig. 3, whereas the latter utilizes the DDR 
synchronous interface depicted in Fig. 5. The controllers 
in both simulators were synthesized with the library 
built on a 130-nanometer process technology. The worst- 
case condition of this library consists of the lO voltage 
of 2.7 volts (V), the internal voltage of 1.35 V, and the 
temperature of 125 °C. The timing parameters of the 
controllers shown in Fig. 3 and Fig. 5 were extracted 
using Synopsys PrimeTime® [29]. 

The NAND flash memory simulated in the experi¬ 
ments was modeled at behavioral level with the timing 
parameters specified in [26] and [27] for SLC and MFC 
implementations, respectively, except for Ibyte- Choos¬ 
ing a reasonable value of Ibyte is crucial for realistic 
simulation results since the maximum data transfer rate 
may be directly determined by Ibyte as shown in 
Eqs. (6) and (9). If the value of Ibyte is too high, then 
the first terms in these equations are eclipsed by Ibyte 
due to the max{ } operator. For our experiments, the 
value of Ibyte was chosen from [28], which contains 
the specifications of OneNAND, one of the fastest (i.e. of 
the smallest Ibyte) NAND flash memory commercially 
available. Note that the conventional NAND flash mem¬ 
ory chips such as OneNAND are fabricated with only 
a single metal layer due to cost issues. If an additional 
metal layer is used, Ibyte would decrease further, and 
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TABLE 2: NAND flash memory timing parameter values 
used in the experiments. 


Parameters 

Conventional (ns) 

Proposed (ns) 

toUT 

7.82 

N/A 

tiN 

1.65 

N/A 

ts 

0.25 

0.25 

tn 

0.02 

0.02 

toiFF 

N/A 

4.69 

tREA 

20 

N/A 

isYTE 

12 

12 


the performance gap between the proposed and the 
conventional architectures would become wider. 

For the workload used in the experiments, we used 
widely used sequential traces that consist of 64-KB 
read/write data chunks [30]. The sequential traces repre¬ 
sent the t 5 q)ical access patterns happening when a large 
volume of data is written to or read from a storage based 
on NAND flash memory. As host interface, the SATA 
interface^ was used. Finally, the overall SSD system was 
modeled at behavior level, and all the aforementioned 
models were integrated using MentorGraphics Seam¬ 
less [31]. 

5.2 Operating Frequency Determination 

Using the simulators we developed, the major timing 
parameters of the proposed and the conventional inter¬ 
face architectures were measured, as listed in Table 2. 
The value of tn/FF was measured using Cubic Ware [32], 
[33]; the difference of the loading capacitances of DVS 
and lO at the board set to 30 pF. The values of ts and 
tn are identical for both architectures since they were 
synthesized with the same library. Note that only the 
first five parameters in the table were obtained from 
measurements; the rest are from the specification of 
NAND flash chips [26], [27], [28]. 

For the conventional SSD, the minimum data access 
period tp,min defined in Eq. (6) can be evaluated as 
tp,min = max 1 7 . 82 + 20 + 1 . 65 + 0.25 ^ = 19.81 nauosec- 

onds (ns) with the value of a = 0.5. Based on this, the 
maximum data access rate of the conventional design 
was set to 50 MHz. For the proposed design, Eq. (9) is 
evaluated as tp.min = max{0.25 -I- 0.2 -I- 4.69,12} = 12 
ns, and the maximum data access rate of the proposed 
design was set to 83 MHz. 

5.3 SSD-Level Performance Analysis 

We compared and contrasted the performance of the 
SSDs designed with the proposed synchronous DDR 
interface with that of the SSDs using the conventional 
interface. The comparison criteria used were i) the write 
and read speeds, which have become one of the most 

1. We used SATA2 or "SATA 3 Gbit/s," which supports the band¬ 
width of up to 300 MB/s. 


important performance metrics for comparing different 
SSDs, and ii) energy consumption. 

Throughout the two sets of experiments detailed in 
Sections 5.3.1 and 5.3.2, we wanted to see how the 
proposed architecture can guide the design decisions 
about the internal channel architecture; this is critical 
since it can trade-off between the area and performance 
of the SSD under design. 

Three different interface designs were implemented 
and compared; the conventional asynchronous interface 
outlined in Section 3, the synchronous (but not double- 
data-rate) interface proposed in [23] and the proposed 
synchronous double-data-rate interface explained in Sec¬ 
tion 4. In this section, these designs are referred to as 
CONV, SYNC_ONLY and PROPOSED, respectively. 

For convenience in implementation, the SYNC_ONLY 
architecture was not developed from the scratch but was 
derived from PROPOSED by replacing DDR transfers 
with single-data-rate transfers. The operating frequency 
of SYNC_ONLY was thus set to 83 MHz. 

5.3.1 Architectures with Different Way interieaving 
We designed single-channel SSDs with five different 
degrees of way interleaving: 1-way, 2-way, 4-way, 8- 
way and 16-way. The write and read performance of 
each design was then measured for the three competing 
interfaces and the two flash cell t 5 Tpes, as shown in 
Fig. 8 and Table 3. The experimental results we obtained 
clearly indicate that the proposed design greatly im¬ 
proves the system performance in corporation with the 
way-interleaving technique, as detailed below. 

• Case I (write, SLC): We first consider the SLC cases 
shown in Fig. 8(a). For the 1-way design, the write per¬ 
formance of CONV and PROPOSED is similar, the latter 
being better only by 9%. This marginal improvement 
originates from the fact that the data transfer time from 
the SSD controller to the NAND flash memory is much 
smaller than the cell program time t prog of the NAND 
flash memory. What PROPOSED reduces is the data 
transfer time, rather than tpRoa- By Amdahls' law, the 
impact of reducing the data transfer time on the overall 
performance is therefore diminished by the dominant 
size of tpROG- 

However, as the degree of way interleaving is in¬ 
creased, the advantage of using PROPOSED becomes 
more evident. For CONV, the performance gain by way 
interleaving decreases as the number of ways increases, 
eventually being saturated at the 8-way design. In con¬ 
trast, for PROPOSED, the interleaving effect was main¬ 
tained throughout all the degrees of way interleaving. 
Note that CONV achieved only about 5x performance 
gain as the number of ways changed from 1 to 16, 
whereas the performance gain by PROPOSED was more 
than llx under the same condition. For the 16-way 
design, PROPOSED outperformed CONV by 2.45 times. 
This difference is caused by the fact that PROPOSED 
enables the controller to put more data in a fixed amount 
of time (i.e. Iprog) than CONV. 
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TABLE 3: Details of the values drawn in Fig. 8. 





(a) Single-Level Cell 



(b) Multi-Level Cell 

Fig. 8: Write/read speed of single-channel SSDs designed 
with different degrees of way interleaving (see Table 3 
for more details). 


The performance of SYNC_ONLY lied between those 
of CONV and PROPOSED, as expected from the fact that 
SYNC_ONLY does not support double-data-rate data 
transfers. 

• Case II (read, SLC): This case is shown in the 
right-hand side of Fig. 8(a). The overall performance 
of reading was higher than that of writing for all the 
three interfaces tested. By design, the way-interleaving 
technique can fully be effective during Ir in the read 
mode, while it does not fully utilize tpROG in the write 
mode. Even in this case, the way-interleaving technique 
is more effective to PROPOSED, since the performance 
of PROPOSED is saturated at the larger degree of way¬ 
interleaving compared to CONV. Namely, PROPOSED 
and CONV are saturated when the degrees of way in¬ 
terleaving are 4-way and 2-way, respectively. The relative 
performance of PROPOSED over CONV in the read 
mode was also higher than that in the write mode for 
all degrees of way interleaving. For instance, PROPOSED 
outperformed CONV by a factor of 2.75 for the 16-way 
design. 


Cell 

Mode 

Way 

Performance (MB/s) 

Ratio 

ct 

S 

P 

P/S 

P/C 



1 

7.77 

8.38 

8.50 

1.01 

1.09 



2 

15.22 

16.59 

17.52 

1.06 

1.15 


Write 

4 

28.94 

31.90 

34.30 

1.08 

1.19 


8 

39.78 

55.36 

63.00 

1.14 

1.58 



16 

39.76 

60.44 

97.35 

1.61 

2.45 

SLC 


Mearfi 

26.29 

34.53 

44.13 

1.16 

1.42 


1 

27.78 

36.66 

47.89 

1.31 

1.72 



2 

42.78 

67.16 

70.47 

1.05 

1.65 


Read 

4 

42.75 

67.13 

117.68 

1.75 

2.75 


8 

42.72 

67.11 

117.64 

1.75 

2.75 



16 

42.69 

67.11 

117.59 

1.75 

2.75 



Mean 

39.74 

61.03 

94.25 

1.49 

2.26 



1 

4.43 

4.55 

4.65 

1.02 

1.05 



2 

8.36 

8.85 

9.24 

1.04 

1.11 


Write 

4 

15.24 

16.75 

18.13 

1.08 

1.19 


8 

25.86 

29.72 

34.08 

1.15 

1.32 



16 

32.45 

45.99 

57.23 

1.24 

1.76 

MLC 


Mean 

17.27 

21.17 

24.67 

1.11 

1.26 


1 

26.04 

33.58 

42.69 

1.27 

1.64 



2 

41.59 

60.41 

77.19 

1.28 

1.86 


Read 

4 

41.55 

64.76 

101.61 

1.57 

2.45 


8 

41.52 

64.75 

110.56 

1.71 

2.66 



16 

41.50 

64.73 

110.52 

1.71 

2.66 



Mean 

38.44 

57.65 

88.51 

1.49 

2.21 


t C: CONV, S: SYNC_ONLY, P: PROPOSED 
t The arithmetic mean for columns 4-6; the geometric mean for 
columns 7-8. 


• Case III (write/read, MLC): Fig. 8(b) shows the results 
for the MLC NAND flash memory design. The read time 
(tp) and the program time {tpHoa) parameters of MLC 
devices are much larger than those of SLC devices. Thus, 
the effect of way interleaving on the overall performance 
decreases in MLC devices for the same degree of way 
interleaving. This reduction in the effectiveness of way 
interleaving is larger in the write mode than in the read 
mode, since tppoG is much larger than tp. This result 
indicates that the proposed interface combined with the 
interleaving technique can be more effective for high- 
capacity storage devices that are composed of many 
MLC chips than for low-capacity storages. We can also 
deduce that the proposed design is more advantageous 
for storage devices with many low-density MLC chips 
than for storages with a small number of high-density 
MLC chips. 

5.3.2 Architectures with Various Channei Configurations 
In practice, the capacity of a storage system is typically 
determined earlier than micro-architectural design pa¬ 
rameters such as the number of ways and charmels. 
Given a capacity value, we can explore the various 
combinations of ways and channels to search for optimal 
design. In this regard, we tested three different SSD 
architectures of varying channel/way configurations 
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(a) Single-Level Cell 



(b) Multi-Level Cell 


Fig. 9: Write/read speed of SSDs designed with different 
numbers of charmels and degrees of way interleaving 
(see Table 4 for more details). 


(1-channel/16-way, 2-channel/8-way and 4-channel/4- 
way), while keeping the product of channels and ways 
constant. In other words, the number of NAND flash 
chips (i.e. total capacity) used in each architecture was 
kept identical. Throughout this experiment, we wanted 
to determine the optimal number of channels and the 
degree of way interleaving, considering the trade-off 
between performance and area. For each design, the 
write and read speeds were measured both for SLC- and 
MLC-based implementations. The results are shown in 
Fig. 9 and Table 4. 

• Case I (write, SLC): In the write mode shown in 
Fig. 9(a), the performance of PROPOSED increased more 
slowly than that of CONV as the area (i.e. the number of 
channels) increased. In our experiment, the architectures 
designed with more charmels have fewer degrees of way 
interleaving, and thus the benefits of using PROPOSED 
decreases as more charmels are used. In the write mode, 
it would therefore be better to increase the degree of way 
interleaving than to increase the number of channels if 
a tight area budget is given. 

• Case II (read, SLC): Unlike the write mode, the 
performance of the three interfaces increases in an almost 
identical fashion as more charmels and fewer degrees of 


TABLE 4: Details of the values drawn in Fig. 9. 


Cell 

Mode 

Ch- 

Performance (MB/s) 

Ratio 

Way 

ct 

S 

P 

P/S 

P/C 



1-16 

39.76 

60.44 

97.35 

1.61 

2.45 


Write 

2-8 

74.07 

101.99 

114.83 

1.13 

1.55 


4-4 

103.76 

115.68 

123.52 

1.07 

1.19 

SLC 


Meant 

72.53 

92.70 

111.90 

1.25 

1.65 


1-16 

42.69 

67.11 

117.59 

1.75 

2.75 


Read 

2-8 

81.44 

126.70 

224.82 

1.77 

2.76 


4-4 

155.35 

237.61 

max§ 

- 

- 



Mean 

93.16 

143.81 

235.25 

1.76 

2.76 



1-16 

32.45 

45.99 

57.23 

1.24 

1.76 


Write 

2-8 

48.72 

56.83 

64.75 

1.14 

1.33 


4-4 

57.46 

63.55 

68.49 

1.08 

1.19 

MLC 


Mean 

46.21 

55.46 

63.49 

1.15 

1.41 


1-16 

41.50 

64.73 

110.52 

1.71 

2.66 


Read 

2-8 

79.32 

122.48 

201.42 

1.64 

2.54 


4-4 

150.94 

230.17 

max 

- 

- 



Mean 

90.59 

139.13 

217.18 

1.68 

2.60 


t C: CONV, S: SYNC_ONLY, P: PROPOSED 
t The arithmetic mean for columns 4-6; the geometric mean for 
columns 7-8. 

§ Reached the maximum bandwidth of the SATA interface. 


way interleaving are used. This is because the interval to 
which the way-interleaving technique is applied is much 
shorter in the read mode (i.e. in the read mode versus 
tpROG iPi the write mode). Note that the read bandwidth 
is much higher than the write bandwidth. Thus, the read 
bandwidth of the (4-channel, 4-way) configuration in 
Fig. 9(a) actually reached the bandwidth of the SATA 
host interface we used. 

• Case III (write/read, MLC) Fig. 9(b) shows the result 
from simulating MLC-based SSD designs. The overall 
performance pattern is similar to that appearing in 
Fig. 9(a). However, the degree of performance improve¬ 
ments is smaller than that in the SLC case. For instance, 
in the SLC-based design, the read bandwidth of PRO¬ 
POSED was improved by 1.91 times as the configuration 
changes from (1-charmel, 16-way) to (2-charmel, 8-way). 
In contrast, in the MLC-based scheme, the read perfor¬ 
mance of PROPOSED increases only by 1.81 times for 
the same change in channel and way configuration. 

This phenomenon becomes more evident in the write 
mode. This is again related to the length of the period 
to which the way-interleaving technique can be applied. 
This period in the write mode is tpnoc/ which is much 
larger than the counterpart tp in the read mode. In 
the write mode, a larger degree of way interleaving is 
required in order to saturate the channel bandwidth. 
Thus, increasing channels is in effect only when the 
degree of way interleaving is sufficiently large. Typically, 
the difference in tppoG between MLC and SLC is much 
larger than the difference in tp between MLC and SLC. 
Therefore, the performance degradation of MLC-based 
SSDs is more clearly seen in the write mode. 
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Fig. 10: Energy consumed by different SSD controllers to 
transfer a single byte (see Table 5 for more details). Unit: 
nano-Joule per byte. 

5.3.3 Energy Consumption Comparison 
To see the impact of the proposed architecture on en¬ 
ergy consumption, we first measured the average power 
consumption of the SSD controllers that adopt different 
interfaces, when these controllers read or write the same 
amount of data. Note that the operating frequencies 
of CONV, SYNC_ONLY and PROPOSED are different. 
Thus, for fair comparison, we further divided the power 
consumption of an interface by the bandwidth (mea¬ 
sured in megabytes per second) this interface operates 
at. In other words, we compared the energy consumed 
by the SSD controllers to transfer a single byte of data. 

Fig 10 and Table 5 show the result we obtained from 
simulating the SLC-based designs for write and read 
operations for various degrees of way interleaving. For 
low degrees of way interleaving, PROPOSED consumed 
more energy than CONV to read or write the same 
amount of data. However, as the degree of way interleav¬ 
ing increases, the energy consumed by PROPOSED grad¬ 
ually became the smallest among the alternatives. Due 
to the performance issues, as discussed in Section 5.3.1, 
it is likely that most SSDs will continue to be designed 
with a reasonably high degree of way interleaving. Eor 
such design, adopting the proposed interface would be 
highly beneficial, since it outperforms the alternatives 
not only in terms of the read/write bandwidth but also 
with respect to the energy efficiency. 

6 Conclusion 

We have proposed a novel SSD architecture that exploits 
double-data-rate synchronous NAND flash interface. 
This new design not only enhances the write and read 
performance but also retains the backward compatibil¬ 
ity with existing single-data-rate asynchronous NAND 
flash memory. The performance of the SSDs that exploit 
the way-interleaving technique can be greatly improved 
by adopting the proposed approach. Our experimental 
results show that the proposed architecture outperforms 
the conventional one by 1.65-2.76 times in the read mode 


TABLE 5: Details of the values drawn in Fig. 10. 


Cell 

Mode 

Way 

Energy (nJ/B) 

Ratio 

ct 

S 

P 

P/S 

P/C 



1 

2.90 

5.01 

5.47 

1.09 

1.89 



2 

1.48 

2.53 

2.65 

1.05 

1.80 


Write 

4 

0.78 

1.32 

1.36 

1.03 

1.74 


8 

0.57 

0.76 

0.74 

0.97 

1.30 



16 

0.57 

0.69 

0.48 

0.69 

0.84 

SLC 


Mean^ 

1.26 

2.06 

2.14 

0.95 

1.45 


1 

0.81 

1.15 

0.97 

0.85 

1.20 



2 

0.53 

0.63 

0.66 

1.06 

1.25 


Read 

4 

0.53 

0.63 

0.40 

0.63 

0.75 


8 

0.53 

0.63 

0.40 

0.63 

0.75 



16 

0.53 

0.63 

0.40 

0.63 

0.75 



Mean 

0.58 

0.73 

0.56 

0.74 

0.91 


t C: CONV, S: SYNC_ONLY, P: PROPOSED 
t The arithmetic mean for columns 4-6; the geometric mean for 
columns 7-8. 

and 1.09-2.45 times in the write mode for the SLC- 
architectures we considered. For the MLC-based archi¬ 
tectures tested, the new design we propose improves the 
performance by 1.64-2.66 times in the read mode and 
1.05-1.76 times in the write mode over the conventional 
design. The proposed scheme can dramatically increase 
the operating frequency of the interface, only limited by 
tsYTE, which is the device-level parameter characterizes 
the read time of a flash cell. As process technology 
advances, tsYTE will keep decreasing, and the impact 
of our scheme will become more prominent. 
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